# Machine Problem 2

## Joel Coghlin – 20228087

## Full Coda Program – With Updated Kernel

\**Note that TILE\_WIDTH and the matrix width were modified for each experiment*\*

//Joel Coghlin - 20228087

//Machine Problem 2

#include "cuda\_runtime.h"

#include "device\_launch\_parameters.h"

#include <stdio.h>

#include <cstdlib> // Include for rand() function

#include <random>

#define TILE\_WIDTH 5

/\*/ Old Kernel function to perform matrix multiplication on GPU

\_\_global\_\_ void matrixMul(float\* A, float\* B, float\* C, int width) {

int row = blockIdx.y \* blockDim.y + threadIdx.y;

int col = blockIdx.x \* blockDim.x + threadIdx.x;

if (row < width && col < width) {

float sum = 0.0;

for (int k = 0; k < width; k++) {

sum += A[row \* width + k] \* B[k \* width + col];

}

C[row \* width + col] = sum;

}

}

\*/

\_\_global\_\_ void matrixMul(float\* A, float\* B, float\* C, int width) {

\_\_shared\_\_ float share\_A[TILE\_WIDTH \* TILE\_WIDTH];

\_\_shared\_\_ float share\_B[TILE\_WIDTH \* TILE\_WIDTH];

int tx = threadIdx.x;

int ty = threadIdx.y;

int bx = blockIdx.x;

int by = blockIdx.y;

int row = by \* blockDim.y + ty; //indices

int col = bx \* blockDim.y +tx;

float sum = 0.0;

if (row >= width || col >= width) return;

for (int m = 0; m < (width-1)/TILE\_WIDTH + 1; ++m) {

int A\_index = row \* width + m \* TILE\_WIDTH + tx; //load the matrices into shared memory arrays.

int B\_index = (m \* TILE\_WIDTH + ty) \* width + col;

share\_A[ty \* TILE\_WIDTH + tx] = A[A\_index];

share\_B[tx \* TILE\_WIDTH + ty] = B[B\_index];

//check if the shared indexing of A tile is out of bounds (row matrix)

if (m \* TILE\_WIDTH + tx < width) {

share\_A[ty \* TILE\_WIDTH + tx] = A[A\_index];

}

else {

share\_A[ty \* TILE\_WIDTH + tx] = 0.0;

}

//check the shared indexing of the B tile as well (column matrix)

if (m \* TILE\_WIDTH + ty < width) {

share\_B[ty \* TILE\_WIDTH + tx] = B[B\_index];

}

else {

share\_B[ty \* TILE\_WIDTH + tx] = 0.0;

}

\_\_syncthreads();

for (int k = 0; k < TILE\_WIDTH; ++k) {

if(m\*TILE\_WIDTH +k < width)

sum += share\_A[ty\*TILE\_WIDTH + k]\*share\_B[k\*TILE\_WIDTH + tx]; //do mat mul with these new arrays

}

\_\_syncthreads();

}

C[row \* width + col] = sum;

}

// Function to perform matrix multiplication on CPU

void matrixMulCPU(float\* A, float\* B, float\* C, int width) {

for (int i = 0; i < width; i++) {

for (int j = 0; j < width; j++) {

float sum = 0.0;

for (int k = 0; k < width; k++) {

sum += A[i \* width + k] \* B[k \* width + j];

}

C[i \* width + j] = sum;

}

}

}

int main() {

int numDevices;

cudaGetDeviceCount(&numDevices);

printf("There are/is %d device(s) available\n", numDevices);

cudaDeviceProp deviceProperties;

cudaGetDeviceProperties(&deviceProperties, 0);

for (int i = 0; i < numDevices; i++) {

printf("Device Number (on system): %d\n", i);

printf("Device Clock Rate: %d kHz\n", deviceProperties.clockRate);

printf("Number of SMs: %d\n", deviceProperties.multiProcessorCount);

printf("Number of Cores: \n");

printf("Warp Size: %d\n", deviceProperties.warpSize);

printf("Total Global Memory: %zu Bytes\n", deviceProperties.totalGlobalMem);

printf("Total Constant Memory: %zu Bytes\n", deviceProperties.totalConstMem);

printf("Shared Memory Per Block: %zu Bytes\n", deviceProperties.sharedMemPerBlock);

printf("Registers Available Per Block: %d\n", deviceProperties.regsPerBlock);

printf("Max Threads Per Block: %d\n", deviceProperties.maxThreadsPerBlock);

printf("Max Size of dimBlock: %d, %d, %d\n", deviceProperties.maxThreadsDim[0], deviceProperties.maxThreadsDim[1], deviceProperties.maxThreadsDim[2]);

printf("Max Size of dimGrid: %d, %d, %d\n", deviceProperties.maxGridSize[0], deviceProperties.maxGridSize[1], deviceProperties.maxGridSize[2]);

}

cudaSetDevice(0);

// Matrix size

int width = 10;

int size = width \* width \* sizeof(float);

// Allocate memory for host matrices

float\* h\_A = 0;

float\* h\_B = 0;

float\* h\_C = 0;

float\* h\_C\_CPU = 0; // For CPU computation

cudaMallocHost((void\*\*)&h\_A, size);

cudaMallocHost((void\*\*)&h\_B, size);

cudaMallocHost((void\*\*)&h\_C, size);

cudaMallocHost((void\*\*)&h\_C\_CPU, size);

/// Seed the random number generator

srand(time(nullptr));

/\*// Initialize host matrices with random numbers from 1 to 10

for (int i = 0; i < width \* width; i++) {

h\_A[i] = static\_cast<float>(rand() % 10 + 1); // Random number from 1 to 10

h\_B[i] = static\_cast<float>(rand() % 10 + 1); // Random number from 1 to 10

}

\*/

for (int i = 0; i < width \* width; i++) {

h\_A[i] = (float)rand() / RAND\_MAX;

}

for (int i = 0; i < width \* width; i++) {

h\_B[i] = (float)rand() / RAND\_MAX;

}

for (int i = 0; i < width \* width; i++) {

h\_C[i] = 0;

h\_C\_CPU[i] = 0;

}

// Allocate memory for device matrices

float\* d\_A, \* d\_B, \* d\_C;

cudaMalloc((void\*\*)&d\_A, size);

cudaMalloc((void\*\*)&d\_B, size);

cudaMalloc((void\*\*)&d\_C, size);

cudaEvent\_t start, stop;

cudaEventCreate(&start);

cudaEventCreate(&stop);

// Copy host matrices to device

cudaEventRecord(start, 0);

cudaMemcpy(d\_A, h\_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d\_B, h\_B, size, cudaMemcpyHostToDevice);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

float milliseconds = 0; //to Hold the time elapsed

cudaEventElapsedTime(&milliseconds, start, stop);

printf("copying host to device took: %f\n", milliseconds);

//Define grid and block dimensions

dim3 dimGrid = dim3((width + TILE\_WIDTH - 1) / TILE\_WIDTH, (width + TILE\_WIDTH - 1) / TILE\_WIDTH, 1);

dim3 dimBlock = dim3(TILE\_WIDTH, TILE\_WIDTH, 1);

//Launch kernel

cudaEventRecord(start, 0);

matrixMul << <dimGrid, dimBlock >> > (d\_A, d\_B, d\_C, width);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&milliseconds, start, stop);

printf("GPU Kernel took: %f\n", milliseconds);

cudaMemcpy(h\_C, d\_C, size, cudaMemcpyDeviceToHost);

/\* Print GPU result (used in debugging)

printf("\nGPU Result:\n");

for (int i = 0; i < width; ++i) {

for (int j = 0; j < width; ++j) {

printf("%.3f ", h\_C[i \* width + j]);

}

printf("\n");

}\*/

// Perform matrix multiplication on CPU for comparison

//cudaEvent\_t start, stop;

cudaEventRecord(start, 0);

matrixMulCPU(h\_A, h\_B, h\_C\_CPU, width);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&milliseconds, start, stop);

printf("The CPU took: %f\n", milliseconds);

//Verify

for (int i = 0; i < width \* width; i++) {

if (fabs(h\_C[i] - h\_C\_CPU[i]) > 1) {

printf("Test FAILED\n");

//break;

}

}

for (int e = 0; e < width \* width; e++) {

printf(" %.3f\n", h\_C[e]);

}

for (int w = 0; w < width \* width; w++) {

printf(" %.3f\n", h\_C\_CPU[w]);

}

/\* Print CPU result (also used in debugging)

printf("CPU Result:\n");

for (int i = 0; i < width; ++i) {

for (int j = 0; j < width; ++j) {

printf("%.3f ", h\_C\_CPU[i \* width + j]);

}

printf("\n");

}\*/

cudaEventRecord(start, 0);

cudaMemcpy(h\_A, d\_A, size, cudaMemcpyDeviceToHost);

cudaMemcpy(h\_B, d\_B, size, cudaMemcpyDeviceToHost);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&milliseconds, start, stop);

printf("copying device to host took: %f\n", milliseconds);

cudaEventDestroy(start);

cudaEventDestroy(stop);

//Copy result matrix from device to host

//cudaEvent\_t start, stop;

//Free device memory

cudaFree(d\_A);

cudaFree(d\_B);

cudaFree(d\_C);

//Free host memory

cudaFreeHost(h\_A);

cudaFreeHost(h\_B);

cudaFreeHost(h\_C);

cudaFreeHost(h\_C\_CPU);

return 0;

}

## Results

I tested, recorded then plotted the results for my new tiling kernel. I expected the use of shared memory to decrease the duration of computing each matrix multiplication. The results are outlined below in Table 1. The baseline matrix multiplication kernel is identified as *Machine Problem 1* and the new shared memory kernel is identified as *Machine Problem 2*. Results for *Machine Problem 1* kernel were pulled from my results from my submission. *Machine Problem 2* data was retrieved by testing the program and altering the TILE\_WIDTH and matrix size. Note that computation time does not include the time required for transferring data to and from the host. However, it **does** include the time taken to place results in shared memory.

Table 1: Tiling Kernel Matrix Multiplication Results

|  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- |
| **Kernel** | **Specifics** | **Duration with Respect to Matrix Size (ms)** | | | | |
| **100** | **250** | **500** | **1000** | **1500** |
| Machine Problem 1 | Standard Block Width of 32 | 0.0123 | 0.459 | 2.836 | 27.731 | 86.605 |
| Machine Problem 2 | TILE\_WIDTH = 2 | 3.571 | 36.199 | 275.45 | 1514.448 | 4839.934 |
| TILE\_WIDTH = 5 | 1.143 | 5.471 | 29.063 | 194.272 | 581.776 |
| TILE\_WIDTH = 10 | 0.212 | 1.395 | 9.018 | 74.067 | 242.475 |
| TILE\_WIDTH = 25 | 0.183 | 0.891 | 7.51 | 55.014 | 175.307 |

This table is be visualized in Figure 1 and Figure 2. In Figure 1, the computation speed for each TILE\_WIDTH is compared with the matrix size. Increasing the tile width showed an immense improvement in computation speed. This is because the kernel spends more time computing the result entries rather than transferring data to shared memory and global memory. In the case that the TILE\_WIDTH is 2, the kernel is spending an unnecessary amount of time transferring data instead of computing the result.

Figure 1: GPU Tiling Kernel Computation Time vs. Matrix Size

Figure 2 shows the baseline kernel’s computation time against each matrix size. It appears to follow the same trend as the trendlines in Figure 1, however, it is important to note that the scale of the computation time axis in Figure 2 only displays up to 100ms. This shows that the baseline matrix multiplication is substantially better than the tiling kernel at TILE\_WIDTH = 2.

Figure 2: Baseline Matrix Multiplication Kernel from MP1

## Comparing With Machine Problem 1 and Discussion Questions

The differences in results for each kernel were – in my opinion – unexpected. I was expecting to achieve greater results with the tiling kernel since it is trying to save time in memory allocation. However, the tiling algorithm does take slightly longer than the baseline kernel from MP1. I thought that this may be caused by the frequent allocation of memory compared to the actual computations. In the baseline kernel, the block width is set and each thread computes an element of the result matrix. This is then written to global memory in one step. In the tiling kernel, while it is true that transferring data from shared memory to global memory is faster, when the TILE\_WIDTH is small, it is required to happen more often. Even if the computation themselves are just as fast, the excessive memory transfers make the timestamps appear longer in this case.